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Unmanned aerial vehicles (UAVs) have gained significant popularity in 
recent years due to their ability to capture high-resolution aerial imagery for 
various applications, including traffic monitoring, urban planning, and 
disaster management. Accurate road and vehicle segmentation from UAV 
imagery plays a crucial role in these applications. In this paper, we propose a 
novel approach combining dual attention mechanisms and efficient multi- 
layer feature aggregation to enhance the performance of road and vehicle 
segmentation from UAV imagery. Our approach integrates a spatial attention 
mechanism and a channel-wise attention mechanism to enable the model to 
selectively focus on relevant features for segmentation tasks. In conjunction 
with these attention mechanisms, we introduce an efficient multi-layer 
feature aggregation method that synthesizes and integrates multi-scale 
features at different levels of the network, resulting in a more robust and 
informative feature representation. Our proposed method is evaluated on the 


UAVid semantic segmentation dataset, showcasing its exceptional 
performance in comparison to renowned approaches such as U-Net, 
DeepLabv*+, and SegNet. The experimental results affirm that our approach 
surpasses these state-of-the-art methods in terms of segmentation accuracy. 
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1. INTRODUCTION 

Road and vehicle segmentation from unmanned aerial vehicle (UAV) imagery represents a critical 
cog in the machinery of several applications, encompassing traffic monitoring, autonomous navigation, urban 
planning, and infrastructure management [1]—[4]. These segmentations facilitate enhanced comprehension 
and modeling of traffic trends, heightened situational awareness, and more strategic decision-making in the 
aforementioned domains. Given the rapid progression in UAV technology and the surge in the availability of 
high-resolution aerial visuals, the need for powerful, efficient segmentation methods capable of tackling 
intricate urban landscapes and yielding accurate results has become increasingly paramount [5]-[7]. 

Image segmentation tasks have witnessed remarkable achievements thanks to the advancements in 
deep learning-based approaches, specifically convolutional neural networks (CNNs). Several architectures, 
such as fully convolutional neural network (FCN) [8], U-Net [9], DeconvNet [10], and SegNet [11], have 
been proposed and exhibited impressive performance across diverse semantic segmentation tasks [12]-[17]. 
Based on the success of these architectures, various methods have been proposed to improve performance of 
road and vehicle segmentation tasks. Zhang ef al. [18] presents a deep learning approach for road extraction 
using a modified U-Net architecture with residual connections. The method significantly improves road 
segmentation performance in remote sensing imagery, as demonstrated by experimental results on multiple 
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datasets. Wan et al. [19], introduce a novel deep learning structure that incorporates a dual attention mechanism 

to improve road extraction performance. The experimental results demonstrate that the proposed DA-RoadNet 

model achieves superior segmentation accuracy in high-resolution satellite imagery compared to existing 
methods. Ren et al. [20] introduce a novel method that combines a capsule-based U-Net structure with dual 
attention mechanisms. The method enhances the model's ability to capture both local and global contextual 

information, resulting in improved road extraction performance in remote sensing imagery. Kestur et al. [21] 

propose the undecimated (UFCN) model that utilizes a FCN for road extraction in high-resolution RGB 

imagery obtained from UAVs. The UFCN model efficiently processes large-scale aerial images, 

demonstrating improved road segmentation performance compared to traditional methods. Varia et al. [22] 

present a deep CNNs-based approach called DeepExt for road extraction using high-resolution RGB imagery 

from UAVs. The proposed convolutional neural network model demonstrates improved road segmentation 
performance by leveraging the features of aerial images, outperforming traditional approaches in road 
extraction tasks. Qian et al. [23], propose a deep CNNs-based structure called DLT-Net that simultaneously 
detects drivable areas, lane lines, and traffic objects in a single framework. The model demonstrates 
improved efficiency and accuracy compared to separate detection methods, showcasing its potential for 
real-time applications in intelligent transportation systems. Lo Bianco et al. [24] present a method that 
combines the semantic segmentation of road objects and lanes within a single CNN framework. This unified 
approach results in improved accuracy and efficiency compared to separate segmentation methods, offering 
potential benefits for autonomous vehicle navigation and intelligent transportation systems. Teichmann et al. 

[25] propose a method called MultiNet, which enables real-time simultaneous semantic reasoning for 

autonomous driving applications. The model efficiently processes input data within a single deep learning 

framework, simultaneously detecting objects, estimating drivable areas, and segmenting lanes, thereby 
improving overall performance, and reducing computational overhead. 

While the aforementioned methods have indeed demonstrated noteworthy success in road and 
vehicle segmentation tasks, they often encounter challenges when applied to high-resolution UAV imagery. 
This type of imagery presents a diverse array of scales, orientations, and appearances of roads and vehicles 
that can confound conventional segmentation techniques. Firstly, the scale of objects in UAV imagery can 
vary dramatically. These wide-ranging scales can make it difficult for conventional segmentation methods to 
accurately differentiate between and identify roads and vehicles. Secondly, the orientation of roads and 
vehicles in UAV images can also pose a challenge. Roads can stretch in multiple directions, not just 
vertically or horizontally, and vehicles can be found in an assortment of orientations, depending on their 
direction of movement. This wide array of possible orientations can confuse conventional segmentation 
algorithms, leading to less accurate results. Moreover, roads and vehicles in UAV images can have a 
multitude of appearances due to differences in design, color, and lighting conditions. Appearance can also be 
influenced by factors such as the time of day or weather conditions, which can alter the visibility of the roads 
and vehicles. Lastly, complex urban scenes often come with their own set of challenges. Occlusions, such as 
one vehicle blocking the view of another, can make it difficult to accurately identify and segment each 
individual vehicle. Similarly, shadows can change the apparent shape and color of roads and vehicles, 
making them harder to segment correctly. Additionally, objects that look similar to roads or vehicles, such as 
rooftops or riverbanks, can confuse the segmentation task, leading to less accurate results. In this paper, we 
propose a novel approach that addresses these challenges and improves segmentation performance. Our 
method builds upon the strengths of the U-Net architectures while incorporating several novel components 
designed to handle the unique challenges associated with UAV imagery. We designed an efficient multi-layer 
feature aggregation strategy that integrates both deep and shallow features in the decoder branch. The feature 
aggregation strategy incorporates both spatial and channel self-attention mechanisms, allowing for adaptive 
feature refinement at each spatial location while effectively utilizing spatial and channel information during 
aggregation. The primary contributions of our work can be summarized as follows: 

- We introduce a dual attention approach that effectively captures both contextual information at local and 
global scales, allowing the model to better distinguish roads, vehicles, and other objects in the scene. 

- Our method employs an efficient multi-layer feature aggregation strategy that integrates multi-scale 
features and enhances the model's ability to segment objects of varying sizes, shapes, and appearances. 

- We conduct extensive experiments using a publicly accessible dataset comprising high-resolution UAV 
images, demonstrating that our proposed method significantly outperforms renowned approaches such as 
U-Net, DeepLabv**, and SegNet in terms of segmentation accuracy and efficiency. 

- We provide a thorough analysis of the results, highlighting the strengths and limitations of our method 
and suggesting avenues for future work. 

The paper is structured as follows: section 2 elaborates on the proposed method, providing a detailed 
explanation. Section 3 analyzes the experimental results and compares them with other approaches. Finally, 
section 4 concludes the paper, highlighting future research avenues. 
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2. METHOD 
2.1. The overall architecture 

The overall pipeline of our model is illustrated in Figure 1. We employ an encoder-decoder 
architecture based on the U-Net [9] architecture for generating the output segmentation maps of roads and 
vehicles. U-Net has emerged as a widely adopted deep learning architecture for image segmentation tasks. 
U-Net consists of an encoder (contracting path) with successive layers of convolutional, ReLU activation, 
and max-pooling operations and a decoder (expanding path) with a combination of up-convolutional layers 
and skip connections from the corresponding layers in the encoder, giving it a symmetric U-shaped structure. 
The encoder captures contextual information, while the decoder combines the contextual information with 
spatial information to accurately segment the input image. We replaced the standard encoder part of U-Net 
with the RestNet-50 [26] architecture. This modification provides an improved feature extraction capability 
due to the deep and powerful ResNet-50 architecture. The decoder part remains the same as in the original 
U-Net, with up-convolutional layers and skip connections from the ResNet-50 encoder. 


Input image 


Attention Aggregation > Pooling operation 
Module 


——-» Upsampling operation 


Output segmentation map 


Figure 1. The overall architecture of our approach 


To improve the segmentation results, we design an attention aggregation module that integrates a 
diverse range of deep and shallow features in the decoder branch. The attention aggregation module 
incorporates both spatial and channel self-attention mechanisms, allowing for adaptive feature refinement at 
each spatial location while effectively utilizing spatial and channel information during aggregation. This 
results in an efficiently generated aggregated feature with a rich representation. In particular, the spatial and 
channel self-attention mechanisms, which employ nonlinear operations, are applied to three distinct feature 
maps from the final decoder layers to create spatial and channel attention maps. These attention maps are 
then reassembled, considering the relationships between the input features, to produce the ultimate feature 
map. This enables the model to capture both global and local contextual information effectively. The 
subsequent subsections will provide detailed explanations of each module. 


2.2. Attention aggregation module 

In the process of extracting roads and vehicles from UAV imagery, enhancing the semantic 
information within deep features is crucial due to the significant scale disparities between road and vehicle 
targets. As demonstrated Liu et al. [27], proved that semantic information gathered from various convolution 
layers progressively coarsens, leading to the loss of valuable details from earlier convolutional layers in the 
final layer. The majority of techniques employed for semantic segmentation tasks utilize the decoder's last 
layer output to create output segmentation maps, consequently diminishing the comprehensiveness and 
precision of the ultimate output segmentation maps. To fully leverage the semantic details, present in each 
layer and enhance segmentation results, we have developed an attention aggregation module that employs 
both spatial and channel attention mechanisms to learn and refine the attention of each input feature, 
integrating them through nonlinear operations. The attention aggregation module combines three last output 
layers of the decoder, i.e., Z, I2, I3 where J; is the highest resolution feature map and Z; is the lowest 
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resolution feature map, as shown in Figure 2. For each input feature map, we first compute both spatial and 
channel attention maps, as illustrated in the next section. Then, we use element-wise multiplication to 
combine each input feature map with corresponding attention map to generate rich semantic feature map. 
Finally, we perform hierarchical attentive fusion by fusing the multi-scale features as (1): 
-y3 
Fout = Li=1 lics Oli (1) 

where Liss is the corresponding attention map generated by spatial and channel attention subnetwork, © 
represents element-wise multiplication with broadcasted unit dimensions. 


f m> Channel & Spatial Attention 
ef 


b g m> Channel & Spatial Attention real 


>89 


9 


b g m> Channel & Spatial Attention p! 


Output map 
® Element-wise multiplication 


a Element-wise summation 


Figure 2. The pipeline of the attention aggregation module 


Through the incorporation of the attention aggregation module, the resulting feature map becomes a 
confluence of features at various scales, encompassing information spanning from shallow to deep levels. As 
a result, the model gains increased flexibility to emphasize the aggregation of feature maps from different 
layers of the network, facilitating the acquisition of more semantic representations. 


2.3. Spatial and channel attention 

Due to the substantial scale disparities between road and vehicle targets encountered in road and 
vehicle segmentation tasks, it becomes crucial to direct attention towards different target objects within 
varying scale contexts. Taking inspiration from SCA-CNN [28] and CBAM [29], which use channel and 
spatial self-attention to perform adaptive feature refinement and improves performance in image 
classification, object detection and image captioning tasks, we design a spatial and channel attention 
subnetwork to generate attention map at each input layer of the attention aggregation module, as shown in 
Figure 3. For each input feature map J; € RC*#*™, we use two 3x3 convolution layers followed by a sigmoid 
activation to generate the spatial attention map J;, € IRI*4xW | At the same time, we apply average pooling 
and max pooling followed by two 1x1 convolution layers and a sigmoid function to get channel-wise 
attention maps J;, € R°*1*1, We then apply element-wise multiplication with equal weighting on these maps 
to generate output attention map J, € RC*#*™, Since the channel and spatial self-attention subnetwork is a 
lightweight module, it can effectively perform adaptive feature refinement without much additional 
computational overhead. 


Spatial attention 


1xHxW 
2x[3*3 conv] Sigmoid g 
CxHxW 


=I Average 2x[1x1 conv] 
Pooling 
i 
Max 2x[1x1 conv] 
Pooling abel 


Channel attention 


®) Element-wise multiplication 


(4>) Element-wise summation 


Figure 3. The detailed architecture of the spatial and channel attention 
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2.4. Loss function 

Most recent semantic segmentation approaches employ cross-entropy loss during the training 
process. This loss function primarily emphasizes the accuracy of pixel classification, assigning equal weight 
to pixels across different regions. However, there exist huge differences in the number of pixels between the 
segmentation targets and the background. Therefore, we employ the dice loss [30] along with cross-entropy 
loss for optimization. The total loss is illustrated as (2): 


Lota = Lee + Laice (2) 
where 
Lely, 9) = — Xil yilog (9i) (3) 
and 
N 1.9, 
Laice = 1- syste (4) 


ai a2 
ENa y2 +N, 5: 


Here, Lee is the categorical cross-entropy loss, Lgice is the dice loss, y; refers to the predicted probability 
value associated with the im pixel, J; refers to the ground truth value of the im pixel, and N refers to the 
overall pixel count of the image. The dice loss is capable of addressing the issue of class imbalance in input 
data. Therefore, the combination loss proves advantageous for segmenting a smaller foreground against a 
larger background, while simultaneously promoting smooth training by employing binary cross-entropy loss. 


3. RESULTS AND DISCUSSION 

This section illustrates the experimental results of our proposed road and vehicle segmentation 
method from UAV imagery and compare its performance with three well-established methods: U-Net [9], 
DeepLabv** [31], and SegNet [11]. We evaluate the segmentation accuracy of each method using common 
evaluation metrics, including Fl-score, intersection over union (IoU), and category mean pixel accuracy 
(MPA). The experiments were conducted on the UAVid semantic segmentation dataset [32], which is an 
openly accessible dataset comprising high-resolution UAV images of various urban and suburban scenes. 


3.1. Dataset and evaluation metrics 

We employ the UAVid semantic segmentation dataset to evaluate the proposed model. UAVid is a 
semantic segmentation dataset designed specifically for aerial imagery captured by UAVs or drones. The 
dataset focuses on urban scenes and aims to facilitate the development of deep learning models for aerial 
image understanding. The dataset comprises high-resolution aerial images captured at different altitudes, 
covering diverse urban scenarios. The images in the dataset have a high resolution of 3840x2160 pixels, which 
allows for the detailed analysis of urban scenes and the extraction of fine-grained features. Every image in the 
dataset is annotated with pixel-level semantic labels, offering an extensive collection of ground truth data for 
training and evaluating semantic segmentation models. The dataset contains multiple object classes, including 
buildings, roads, trees, vehicles, and pedestrians. Given the large size of the training set images, we initially 
extract 10,000 non-overlapping small patches of size 512x512 from the training set. Subsequently, we employ 
8,000 image patches from this set to train the proposed model. For the task of road and vehicles extraction, we 
only employ labels of three categories for each pixel, including: road, car, and background. 

To assess the performance of semantic segmentation models on the UAVid dataset, we use several 
evaluation metrics, including Fl-score, IoU, and category MPA. The Fl-score is a balanced measure of 
segmentation performance, calculated as the harmonic mean of precision and recall. IoU quantifies the 
overlap between the predicted segmentation and the ground truth for an object class. MPA represents the 
percentage of accurately classified pixels in the image. 


3.2. Implementation details 

All methods were implemented using TensorFlow and trained on an NVIDIA RTX 4080 GPU. The 
models underwent training for 150 epochs using the Adam optimizer, with a learning rate of 0.0001 and a 
batch size of 16. Data augmentation techniques such as horizontal flipping, random cropping, and brightness 
adjustments were applied to prevent overfitting. 


3.3. Experimental results 

Table 1 presents the experimental results of our proposed method alongside the comparative methods. 
For road segmentation, the proposed method achieves an IoU of 84.91%, which is higher than U-Net (82.68%), 
DeepLabv** (81.99%), and SegNet (78.93%). This indicates that our method is more effective at delineating 
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road boundaries and identifying connected road networks. The F1-score for roads also demonstrates the superior 
performance of our method, with a score of 90.46% compared to 90.06% for U-Net, 90.10% for DeepLabv**, 
and 88.22% for SegNet. The higher Fl-score implies that our method has a better balance between precision 
and recall in road segmentation tasks. In terms of vehicle segmentation, the proposed method achieves an IoU 
of 79.66%, outperforming U-Net (74.31%), DeepLabv** (77.08%), and SegNet (66.37%). This suggests that our 
method is more capable of accurately identifying individual vehicles, even in congested scenes or when vehicles 
are partially occluded. The Fl-score for vehicles further supports this observation, with our method achieving a 
score of 87.10%, which is higher than U-Net (85.06%), DeepLabv** (87.06%), and SegNet (79.78%). This 
demonstrates that our proposed method maintains a better trade-off between precision and recall for vehicle 
segmentation tasks. The mean pixel accuracy of our proposed method is 93.89%, which is notably higher than 
U-Net (90.35%), DeepLabv** (93.23%), and SegNet (87.58%). This metric quantifies the proportion of 
correctly classified pixels in relation to the total number of pixels, and the higher value for our method indicates 
that it is more effective at assigning the correct class labels to individual pixels. This superior pixel-wise 
classification performance contributes to the overall improved segmentation results for both roads and vehicles. 
The detailed analysis of the quantitative results demonstrates that our proposed method outperforms U-Net, 
DeepLabv**, and SegNet across all evaluation metrics. This superior performance can be attributed to the 
method's ability to capture both local and global contextual information, as well as the inclusion of an attention 
mechanism to enhance the model's capacity to focus on relevant regions. Consequently, our proposed method is 
better suited for road and vehicle segmentation tasks in high-resolution UAV imagery. 


Table 1. Segmentation results on the UAVid semantic segmentation dataset 


Method Fl-score (roads) _ Fl-score (vehicles) _IoU (roads) _ IoU (vehicles) MPA 
U-Net [9] 90.06 85.06 82.68 74.31 90.35 
DeepLabv* [31] 90.10 87.06 81.99 77.08 93.23 
SegNet [11] 88.22 79.78 78.93 66.37 87.58 
Our method 90.46 87.10 84.91 79.66 93.89 


Figure 4 shows visual segmentation results of our proposed method and the comparative methods. 
Visual inspection of the segmentation results from input images (Figure 4(a)) further demonstrates the 
effectiveness of our proposed model, as illustrated in Figure 4(b). In comparison to DeepLabv**( Figure 4(c)) 
and U-Net (Figure 4(d)), the proposed model generates more accurate and coherent segmentations of roads 
and vehicles. The comparative methods tend to suffer from issues such as over-segmentation and 
misclassification, particularly in complex scenes with occlusions and shadows. In contrast, our method can 
better handle such challenges and produce more accurate and robust segmentation results. 


Figure 4. Segmentation results with; (a) input image, (b) segmentation output of our model, (c) DeepLabv3+, 
and (d) U-Net 
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4. CONCLUSION 

In this study, we proposed a novel deep learning methodology aimed at overcoming the 
complexities involved in the segmentation of roads and vehicles within high-resolution UAV imagery. Our 
strategy capitalized on the advantages offered by U-Net architectures, integrating them with an attention 
aggregation module designed to incorporate a diverse spectrum of deep and shallow features in the decoder 
section. Our method's unique dual attention mechanism and effective multi-layer feature aggregation scheme 
allowed for the successful capture of both local and global contextual data, thus enabling more accurate, 
consistent segmentation of roads and vehicles in intricate urban landscapes. This not only advances our 
understanding of the complex features in urban UAV imagery but also paves the way for further 
advancements in high-resolution image segmentation technologies. Comparatively, our approach 
significantly outperformed current leading methods such as U-Net, DeepLabv**, and SegNet in relation to 
segmentation accuracy and efficiency. This superior performance underscores the potential of our method as 
a more effective solution for urban scene analysis and interpretation, particularly in applications where 
precision and speed are paramount. The findings of this study have profound implications. By enabling more 
accurate segmentation, they improve our understanding of urban environments as seen from UAVs, which 
can be invaluable in fields such as urban planning, traffic management, and environmental monitoring. 
Looking ahead, we plan to extend our segmentation technique to include other object classes commonly seen 
in UAV imagery, such as buildings, pedestrians, and vegetation. This expansion aims to provide a more 
holistic understanding of urban landscapes, significantly broadening the potential applications of our method. 
Future research will further investigate the implications of these findings, shaping a new direction for the 
application of deep learning in image segmentation and urban analysis. 
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